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METHOD OF, AND SYSTEM FOR, SCANNING ELECTRONIC DOCUMENTS 
WfflCH CONTAIN LINKS TO EXTERNAL OBJECTS 



Introduction 

The present invention relates to a method of, and system for, replacing external 
links in electronic docimients such as email with internal links. One use of this is to ensure 
that email that attempts to bypass email content scanners no longer succeeds. Another use is 
to reduce the effectiveness of web bugs. 



Background 

Content scaiming can be carried out at a number of places in the passage of 
electronic documents from one system to another. Taking email as an example, it may be 
carried out by software operated by tihie user, e.g. incorporated in or an adjunct to, his email 
client, and it may be cairied out on a mail server to which the user connects, over a LAN or 
WAN, in order to retrieve email. Also, Internet Service Providers (ISPs) can carry out 
content scanning as a value-added service on behalf of customers who, for example, then 
retrieve their content-scanned email via a POPS accoimt or similar. 

One trick which can be used to bypass email content scaimers is to create an 
email which just contains a link (such as an HTML hyperlink) to the undesirable or "nasty" 
content. Such content may include viruses and other varieties of malware as well potentially 
offensive material such as pornographic images and text, spam and other material to which 
the email recipient may not wish to be subjected. The content scanner sees only the link, 
which is not suspicious, and the email is let through. However, when viewed in the email 
client, the object refOTed to may either be bought in automatically by the email client, or 
when the reader clicks on the link. Thus, the nasty object ends up on the user's desktop, 
without ever passing through the email content scanner. 
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It is possible for the content scanner to download the object by following the 
link itself It can then scan the object. However, this method is not foolproof - for instance, 
the server delivering the object to the content scanner may be able to detect that the request is 
from a content scanner and not from the end user. It may then serve up a different, innocent 
5 object to be scaimed. However, when the end-user requests the object, they get the nasty one. 

Summary of the Invention 

The present invention seeks to reduce or eliminate the problems of embedded 
links in electronic documents and does so by having the content scanner attempt to follow a 

10 link found in an electronic document and scan the object which is the target of the link. If the 
object is found to be acceptable from the point of view of content-scanning criteria, it is 
retrieved by the scamer and embedded in the electronic document and the link in the 
electronic document is adjusted to point at the embedded object rather than the original; this 
can then be delivered to the recipient without the possibility that the version received by the 

15 recipient differs from the one originally scaimed. 

If the object is not found to be acceptable, one or more remedial actions may be 
taken: for example, the link may be replaced by a non-fimctional link and/or a notice that the 
original link has been removed and why; another possibility is that the electronic document 
can be quarantined and an email or alert generated and sent to the intended recipient advising 

20 him that this has been done and perhaps including a link via which he can retrieve it 
neverflieless or delete it The process of following links, scanning the linked object and 
replacing it or not with an embedded copy and an adjusted link may be applied recursively. 
An upper limit may be placed on the number of reciusion levels, to stop the system getting 
stuck in an infinite loop (e.g. because there are circular links) and to effectively limit the 

25 amount of time the processing will take. 
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Thus according to the present invention there is provided a content scanning 
system for electronic documents such as emails comprising: 

a) a link analyser for identifying hyperlinks in document content; 

b) means for causing a content scanner to scan objects referenced by links 
identified by the link analyser and to determine their acceptability according to predefined 
rules, the means being operative, when the link is to an object external to the document and is 
determined by the content analyser to be acceptable, to retrieve the extemal object and modify 
the document by 

bl. embedding in it or attaching to it the retrieved copy of the object; and 
hi, replacing the link to the extemal object by one to the copy embedded 

in, or attached to, the document. 

The invention also provides a method of content-scanning electronic 

documents such as emails comprising: 

a) using a link analyser for identifying hyperlinks in document content; 

b) using a content scanner to scan objects referenced by links identified by 
the link analyser and to determine their acceptability according to predefined rules, the means 
being operative, when the link is to an object extemal to the document and is determined by 
the content analyser to be acceptable, to retrieve the extemal object and modify the document 
by 

bl. embedding in it or attaching to it the retrieved copy of the object; and 
b2, replacing the link to the extemal object by one to the copy embedded 
in, or attached to, the document. 

Thus the content scanner can follow the link, and download and scan the 
object. If the object is judged satisfactory, the object can then be embedded in the email, and 
the link to the extemal object replaced by a link to the object now embedded in the email. 
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One trick used by spammers is to embody *web bugs' in their spam emails. 
These are unique or semi-unique links to web sites - so a spammer sending out 1000 emails 
would use 1000 different links. When the email is read, a connection is made to the web site, 
and by finding which link has been hit, the spanamer can match it with their records to tell 
5 which person has read the spam email. This then confirms that the email address is a genuine 
one. The spammer can continue to send email to that address, or perhaps even sell the address 
on to other spammers. 

By following every external link in every email that passes through the content 
scanner, all the web bugs the spammer sends out will be activated. Their effectiveness 
1 0 therefore becomes much reduced, because they can no longer be used to tell which email 
addresses were valid or not. 

The invention vriU be fiirther described by way of non-limiting example with 
reference to the accompanying drawings, in which:- 

Figurel shows the "before" and "after" states of an email processed by an 
15 embodiment of the present invention; and 

Figure 2 shows a system embodying the present invention. 

Figure la shows an email 1 which comprises a header region 2 and a body 3 
formatted according to an internet (e.g. SMTP/MIME) format. The body 3 includes a 
hypertext link 4 which points to an object 5 on a web server 6 somewhere on the internet. 
20 The object 5 may for example be a graphical image embedded in a web page (e.g. HTML or 
XHTML); 

Figurelb shows the email 1 after processing by the illustrated embodiment of 
the invention and it will be seen that the object 5 has been appended to the email (e.g. as a 
MIME attachment) as item 5' and the link 4 has been adjusted so that it now points to this 
25 version of the object rather than the one held on the external server 6; and 
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Figure 2 is an illustration of a system 10, according to the present invention 
which may be implemented as a software automaton. Although the invention is not limited to 
this application, this example embodiment is given in terms of a content scanner operated by 
an ISP to process an email stream e.g. passing through an email gateway. 

Operation of Embodiment 

1) The email is analysed by analyser 20 to determine whether it contains 
external links. If none are found, omit steps 2 to 5. 

2) For each external link, the external object is obtained by object fetcher 
30 from the internet. If the object cannot be obtained, go to step 7, 

3) The external objects are scanned by analyser 40 for pomography, 
virases, spam and other imdesirables. If any are found, go to step 7. 

4) The external objects are analysed to see whether they contain external 
links. If the nesting limit has been reached, go to step 7. Otherwise go to step 2 for each 
external link. 

5) The email is now rebuilt by email rebuilds 50. In the case of MIME 
email, the external links are replaced with internal links, and the objects obtained are added to 
the email as MIME sections. Non-MIME email is first converted to MIME email, and the 
process then continues as before. 

6) The email is sent on, and processing stops in respect of that email. 

7) An undesirable object has been foimd, or the object could not be 
retrieved, or the nesting limit has been reached. We may wish to block tiie email (processing 
stops), or to remove the links. We may also want to send warning messages to sender and 
recipient if the email has been blocked. Meanwhile the email may be held in quarantine as 
indicated at 60, which may be implemented as a reserved file directory. 
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The following email contains a link to a website. 



Subject: email with link 

Subject: 

Date; Thu, 9 May 2002 16:17:01 +0600 
MIME-Version: 1.0 
Content-Type : text /html ; 
Content-Transfer-Encoding: 7bit 

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 

<HTML><HEAD> 

</HEAD> 

<BO0y bgColor«3D#f f f f f f> . 
<DIV>  </DIV> 
This is sortie text<BR> 

<DIv><I!yiAGE src="http: / /www. messagelabs - com/ images /global /nav/box-images/virus-eye 

light.gif" > 

</DIV> 

This is some more text<BR> 
</BODY></HTML> 

The binary content of "http://www.messagelabls.com/images/global/nav/box-images/virus- 
eye-light.gif' is as follows: 
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This file can be downloaded, scanned, and if acceptable, a new email can be created with the 
image embedded in the email: 

Subject: email with link 
Subject: 

Date: Thu, 9 May 2002 16:17:01 +0600 

MIME-Version: 1.0 

Content-Type: multipart/related; 

bo\mdary="ABCD" ; 
Content-Transfer-Encoding : 7bit 



— ABCD 
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Content-Type : text /html ; 
Content-Transfer-Encoding: 7bit 

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 
5 <HTML><HEAD> 
</HEAD> 

<BODY bgColor=3D#ffffff> 
<DIV> finbsp ; < / DI V> 
This is some text<BR> 
10 <DIV><IMAGE src=cid:EXTERNAL> 
</DIV> 

This is some more text<BR> 
</BODY></HTML> 

15 — ABCD 

Content-ID: <EXTERNAL> 
Con t ent-Type : image /git; 

name="image001 . gif " 
Content-Transfer-Encoding : base64 
20 Content-Disposition: attachment; 

f ilename="image001 .gif" 

R01GODXhFwAXAMQAAICAgGRWBAAAAFJSU8irBP39/f /YAKqQBP/OAAshVzQrA8bGx7Cuq412BRyV 
FxgUAiYnLMaxTL2+w+/LAyQdAgMLHgAqhEc8AwwKBZONcqGfl+Lh4dvh9ti6A//tAeS+HiH5BAAA 
25 AAAALAAAAAAXABcAAAX2ICCOZGniOSKqu6mQg74uI7PoSTXNOBP/SNcOkcwg8FIpirgdcDTtIjEDw 
uCAbgUM2ZSBcGtOwmHIJiAzoV+ciboelogPW+nDbpyLpXeDQtx9xDwEUbwMZCxoOU3pJHSIBAWB8 
ABIFBRIDAwIJAwAYFAEEjxllEAuWlgMQmhYJG5oCFyITYBinqAWwFaOMEFOAABOEA7iWDA4JFhYV 
DoJlcVMZxZYLAxW/Bx5DImwCDLgbEhmqAhgBHTIGIsMLAJkQfZ91BGgrccQZkJA6BC71LGcicIiQ 
pinANewAQfNgQ4aDDFDQ+FNCA7mGNiJcCyLCo4oTHjyEAADs= 

30 



~ABCD~ 



